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Abstract: When a series of (related) linear models has to be estimated it 
is often appropriate to combine the different data-sets to construct more 
efficient estimators. We use £i-penalized estimators like the Lasso or the 
Adaptive Lasso which can simultaneously do parameter estimation and 
model selection. We show that for a time-course of high-dimensional lin- 
ear models the convergence rates of the Lasso and of the Adaptive Lasso 
can be improved by combining the different time-points in a suitable way. 
Moreover, the Adaptive Lasso still enjoys oracle properties and consistent 
variable selection. The finite sample properties of the proposed methods 
are illustrated on simulated data and on a real problem of motif finding in 
DNA sequences. 
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1. Introduction 

The Lasso [l6[ has attracted a lot of attention for prediction and variable selec- 
tion in linear regression models, including high-dimensional settin gs w here the 
number of covariates is much larger than sample size Hi S, SMIgi, Not 



only has the idea of ^i-penalization shown its success in other models 
but also many extensions of the Lasso in linear regression models have been pro- 
posed, among them are the Fused Lasso [18], the Adaptive Lasso [HI and the 



Relaxed Lasso [11 1 



Also for multivariate regression, penalization estimators have been shown to 



be successful [19|, ll5[ . In many problems there is a natural ordering of the re- 
sponse space: our new methodology and theory are exploiting this fact. If we 
think of time-course data where we observe a response variable over certain 
time-points, the relationship between "neighbouring" time-points is expected to 
be stronger than between more distant time-points. Instead of separately es- 
timating a parameter vector for each time-point, it is often a better strategy 
to combine information across different time-points. By putting an appropriate 
constraint on the parameter vector, we can control certain characteristics, e.g. 



597 



L. Meier and P. Biihlmann/ Smoothing l\-penalized estimators 



598 



the smoothness over time. As an advantage, we may get a more efficient estima- 
tor: By pooling of information, we reduce the variance, typically at the cost of 
some bias, to achieve a lower mean squared error. For multivariate regression, [2] 
use the correlation between the responses to construct an estimator with lower 
mean squared prediction error. In the discussion of 0, the idea of relevance 
weighted likelihood 0] is mentioned [13]. We use this idea for ^-penalized esti- 
mators. By using an estimator which also fits well for neighbouring time-points, 
we can not only get a smoother behaviour of the parameter vector over time, 
but also profit from more efficiency, both in estimation accuracy and in variable 
selection. 

The rest of this article is organized as follows. In Section [2] we introduce 
the Smoothed Lasso estimator and show that it asymptotically reduces the 
bound on the mean squared error compared to the univariate Lasso estimator. 
In Section [3] we apply the smoothing idea to the Adaptive Lasso and variants 
thereof and show that it can consistently select the correct model and has a faster 
convergence rate than the univariate estimator. Simulations follow in Section [4] 
and a real data analysis for motif search in DNA sequences in Section [5j Section 
[5] contains the discussion. All proofs can be found in the Appendix. 

2. Smoothed Lasso 

We first start with the definition of the Smoothed Lasso estimator and then 
study its theoretical properties. 

2.1. Definition 

Assume that we observe data at N different time-points and that at each time- 
point t r , r = 1, . . . , N, we have a linear regression problem of the form 

y{t r )=Xp(tr)+e{t r ), 

where X is the nxp design matrix, y(t r ) <E R™ is the response vector, f3(t r ) £ M p 
is the parameter vector corresponding to time-point t r and s(t r ) € M™ is the 
corresponding error vector: e{t r ), r = 1, . . . , N are assumed to be i.i.d. random 
vectors with i.i.d. components having mean zero and finite variance a 2 . Note 
that the design matrix X does not depend on t r in our setup (but it could) , and 
hence we consider a multivariate linear model. As commonly used for penalized 
estimation, we assume that the columns of X are centered and scaled to have 
empirical column means and column variances 1. 

Remark 2.1. Generalizations of the methodology and theory include that the 
design matrix X depends on t r , i.e. X{t r ), and that the errors have correlated 
components Cov(e(t r )) = E or arise from a dependent, stationary process with 
respect to the time-points. 
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The idea of the Smoothed Lasso is to use an ^-penalty and to suitably 
combine or smooth the information of the different time-points. It is defined as 

AT p 

f3\ n ,w{tr) = argmin ^2w(t s ,t r )\\y(t s ) -XP\\l + X n ^2\(3j\ , (2.1) 

P s=l j=l 

where w(t s , t r ) are weights satisfying 5Z S=1 w (t s , t r ) = 1. A typical choice is 

w(t s , t r ) oc K 

where K(-) is a univariate kernel, i.e. K(x) > 0, K(x) = K(—x), ^ oQ K{x)dx 
= 1, and h is a bandwidth parameter. Thus, the Smoothed Lasso is (3{t r ) — 
(3\ n ,h{t r ), depending on two tuning parameters. 

We can rewrite the weighted optimization problem (|2.1j) as an ordinary Lasso 
problem 

p 

$x n ,h(t r ) = argmin \\y(t r ) - X[3\\ 2 2 + \ n V \(3 \ , (2.2) 
p 

where 

N 
s=l 

Hence, any algorithm to solve a standard Lasso problem can be used to calculate 
the Smoothed Lasso estimator for a given bandwidth h. 

By forcing the estimate (3(t r ) to fit also well for "neighbouring" time-points, a 
smooth (non-parametric) trend of (3(t r ) as a function of time is usually inherited 
(if p < n this is always true because of strict convexity with respect to j3 and 
continuity with respect to t r of the criterion in (|2.2[) V 

Remark 2.2. Another approach would be to use a Fused Lasso penalty which 
also penalizes the absolute value of the differences between neighbouring time- 
points, i.e. \(3j(t r ) — j3j(tf.-i)\. Such an approach has two drawbacks: First, we 
have to model all time-points simultaneously, i.e. fit a model with Np parame- 
ters. Moreover, the Fused Lasso problem is more difficult to solve than the Lasso 
problem. In our approach we fit N Lasso problems with p parameters each. Sec- 
ond, if we want to mimick the behaviour of a bandwidth which is locally adaptive 
to the underlying true parameter function (3{t), we have to introduce a lot of 
tuning parameters for the Fused Lasso and search over a high- dimensional grid 
when doing cross-validation. 

Remark 2.3. We do not assume that the active set (the set of predictors with 
nonzero coefficients) stays the same over all time-points. Our methodology allows 
for the fact that some predictors enter or leave the active set along the time- 
course. 

In the next Section we first consider the special case of an orthogonal design 
and indeed, we show that the mean squared error is decreased asymptotically. 
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2.2. Orthogonal Case 

We consider the situation where the number of parameters equals the number 
of observations and the design matrix is orthogonal, i.e. X T X = nl n , and 
where the errors e(t r ) are Gaussian. We focus on a single time-point of interest 
and therefore omit the time-index for notational simplicity. In [a, Theorem 1] 
it is shown that the (univariate) soft-threshold estimator (3st (with threshold 
parameter <Ty/2log(n)/n), which is equivalent to the Lasso in the orthogonal 
case (with penalty parameter A = 2<ry / 2 log(n)n), satisfies 

n08T - £111] < (21og(n) + 1) |^ + E min (Pi } ( 2 - 3 ) 

for all [3 £ M. n and that this bound is asymptotically sharp in a minimax-sense 
[H, Theorem 3]. If the non-zero /3j's are not of too low order (i.e. |/3j| 3> n^ 1 / 2 ), 
the order of this bound is log(n) \A n \ /n, where A n = {i \ fi% ^ 0} denotes the 
active set of the time-point of interest. Even though we restrict ourselves to a 
class of parameter vectors which stay out of the rt~ 1,/2 -range, the order of the 
bound in (j2.3|) is sharp because the maximal risk is attained for an element of 
this class (see the proof of Theorem 3 in [f|). 

The order log(n) |*4„|/n can be decreased by the smoothed estimator as 
shown in Proposition ^. 11 



Proposition 2.1. Assume Gaussian errors e(t r ) and the regularity conditions 
(RC\B) - (RC\$ described in Appendix\Xj\ For h = h n x log(n) 1 ' ' 5 'n' 1 ^ 'N' 1 / 5 
and N = N n such that Nh — > oo (n — > oo) the risk of the Smoothed Lasso from 
12. 2\) asymptotically satisfies: for \ n — 2o(Nh)~ 1 / 2 yj2 log(n)n 

n\Px n M n -P\\l] < Clog^KI/CniV^) 

x log(n) 4 / 5 |X|/(n 4/5 iV 4/5 ), n^oo. 
for all [3 £ K™ and some constant C . 
A proof is given in Appendix IA.21 

For a faster convergence rate than log(n) \A n \ /n (for the unsmoothed Lasso) 
we require Nh to converge to infinity which implies that 

( n \ 1/4 
N = N n » —— 
\log{n)J 

i.e. N can be of much lower order than n for achieving a faster convergence rate 
for the minimax bound. 



2.3. General Case 

Let us now consider the general case, i.e. we do not restrict ourselves to an 
orthogonal design matrix. In particular, we allow for high-dimensional situations 
where p — p n ^> n is increasing very fast as n — > oo . 
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Using the results in [131 ] for a fixed design, the univariate Lasso estimator 
satisfies under regularity conditions on the design matrix 

\\hasso,x n - /3||! < Op (a 2 ^ log( Pn )) + O , n ^ oo, 

where m\ = Cn 2 j\ 2 n for some constant C. In a certain sense, this bound is 
tight, see 13J, Remark 1]. We can choose 

and arrive at the optimal rate 

0Las S o,x n -P\\t< Op (an- 1 ' 2 log(p„) 1/2 IAJ 172 ) , n -> oo. (2.4) 

For proving such a result, assumptions on the d esig n matrix are crucial: Various 
authors use different conditions, cf. [HI, 0, [2^, |22| . We refer the reader to [l3| 
for a detailed description of the regularity conditions for 



Proposition 2.2. Assume that the univariate Lasso satisfies {2.1$ and denote 
the bound on the right hand side of {2.$ by a n — an^ 1 ' 2 log(p n ) 1 ' 2 lAn] 1 ' 2 . 
Furthermore, assume the regularity conditions (R(J]ty - (R(^ described in Ap- 

pendix \A.l[ Then, if N — N n 3> |-4 n | l74 an 174 and for some suitable X n and 
h = h n : 

\\P\n,hn - PWt = °p( a n), 

i.e. the Smoothed Lasso has a faster convergence rate than the (tight) bound in 
\2.J$ for the Lasso. 

A proof is given in Appendix IA. 21 Suitable choices for A n and h n in Propo- 
sition 12.21 are 

K^N-^\A n \- 2/9 a 2 J» 

and 

A„ X a 1 / 2 (iV/,)- 1 /4 n 3/4 log(pii) l/4 |Ar l/4_ 

Using the notation that is introduced at the beginning of the proof of Proposition 
12.11 one can derive other asymptotic properties by linking known results for the 
Lasso [f| 0, [2(| with the smoothed model 

y = Xf3 + e 

and an analysis of the bias term \\[3 — (3\\ q for q e {1, 2} as in (|A.3|) . 

Up to now we only considered the estimation error for [3 and no variable 
selection properties. The smoothing reduces the variance and thus it can be 
expected that the Smoothed Lasso selects more (noise) variables than its uni- 
variate counterpart. Empirical evidence of this property is given in Section 2J 
This problem can be overcome by a second stage which removes many of the 
coefficients whose estimates are close to zero. In fact, already the case with a 
univariate response often requires such a second stage for consistent variable 
selection [24j. We will treat a special case in the next Section. 
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3. Smoothed Adaptive Lasso 

The Adaptive Lasso [24[ weights the penalty for the different coefficients using 
an initial estimator pi n u, i.e. 

pf™ t] = argmin \\y - Xpf 2 + A„ £ f 3 \P 3 \ , 

where fj = l/\Pi n it,j\ y for some 7 > are weights based on the initial estimator 
/3i n it ■ For simplicity we will restrict ourselves to 7 = 1 . In [24| , the ordinary least 
squares (OLS) estimator is used for Pi n u- here, we will mainly use the Lasso 
and versions thereof. Through a re-scaling of the columns of the design matrix, 
the Adaptive Lasso estimator can be formulated as an ordinary Lasso problem, 
see (H- 

We can also apply the smoothing technique of Section [5] to the Adaptive 
Lasso. In the smoothed case we again replace the residual sum of squares in the 
objective function with its smoothed counterpart in (|2.2p . i.e. 



PtTHtr) = argmin \\y(t r ) - X(if 2 + X n ]T fj \Pj\. (3.1) 





In [24|, an asymptotic oracle result for the Adaptive Lasso is given for fixed 
dimension p. We show that the Smoothed Adaptive Lasso has a faster conver- 
gence rate. Again, as we focus on a single time-point, we omit the time-index 
for notational simplicity. 

We will consider the situation where the number of variables p is kept fixed 
as n — > 00. As before, let A be the active set of the true parameter vector at 
the current time-point and A n be its empirical counterpart. 

Theorem 3.1. Assume a fixed design with linin^oo -^X T X = C for some 
positive definite matrix C. If Pi n u ~ P = Op{a~ 1 ) for some sequence a n — > 
00, \ n \fN n h n /y/n. — ► 0, A„ a n \J N n h n / yfh — > oo, h n = o{n~ 1 / b N n ~ 1 ^) and 
N n h n — > oo (n — ► oo), then the Smoothed Adaptive Lasso in h3.1\) satisfies 
under the regularity conditions (R<J^) - (i?(Q) described in Avvendix \A.l\ 

lim P(A n = A) = 1 

n—>oc 

and 

V^^GS&EU " M ^ N(0, cjUCaa)- 1 ), n -> oo, 

where a 2 = a 2 K 2 (x)dx, Pa, Pa are the sub-vectors of p, P and Caa * s the 
sub-matrix of C corresponding to the active set A. 

A proof is given in Appendix IA. 21 Thus, if the initial estimator is consistent, 
we can find a sequence A n such that the Smoothed Adaptive Lasso has the 
property of consistent model selection and asymptotic normality on the active 
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set A. Furthermore, if N = N n ^> n 1 / 4 , we can choose h = h n = o(7i _1 / 5 iV rl 
such that Nh — * oo. Thus, as already pointed out in Section FZ72\ a relatively 
small value of N = N n is sufficient for achieving an improved convergence rate. 

Remark 3.1. The optimal convergence rate n~ 2 / 5 N~~ 2 / 5 in Theorem \3.1\ can 
be achieved using h n x n~ 1 / 5 N~ 1 / 5 . Then, the limiting normal distribution 
becomes J\f(Bj^, o-^Caa)^ 1 ) for some vector Bj± with < |-E>.4j| < oo for all 
j. This is the same distribution as when using local least squares (with kernel 
K). Hence, the Smoothed Adaptive Lasso has an oracle property saying that it 
is asymptotically as good as local least squares with the true underlying active 
set A known beforehand. 

3.1. Choice of initial estimator 

The choice of the initial estimator will influence the final estimator. In particular, 
the sparsity of the final estimator can be maximized by making an appropriate 
choice, as discussed below. 

We will first focus on univariate estimators, i.e. on estimators which only use 
the data of the current time-point. In view of Theorem l3.1[ the basic assumption 
for the initial estimator is consistency. The ordinary least squares (OLS) method 
is a possible choice for low-dimensional problems with fixed dimension p as 
it is Y^n-consistent. The Lasso is consistent in an ^2-sense, even in the high- 
dimensional setting, see Section l2T3l Finally, the Adaptive Lasso is y^-consistent 
for fixed p [24| and consistent under suitable regularity conditions for p 3> n [§] . 
For high-dimensional problems the OLS estimator is not appropriate because 
it is unstable or even not defined in a p > n situation. The Lasso or Adaptive 
Lasso are more appropriate choices. 

If the initial estimator is doing variable selection, i.e. some of the coefficients 
Pinit.j = 0, the smoothed estimator is at least as sparse as the initial estimator: 
a zero-coefficient in the initial estimator, i.e. 0i n it i j — 0, results in an infinite 
penalty for that component, i.e. fj = oo, forcing the smoothed estimate to 
be zero, i.e. $j(t r ) = 0. This reduces the computational complexity for the 
smoothing stage since some or even many predictors can be excluded from the 
model. 

For the case that the initial estimator has a tuning parameter, as with the 
Lasso and the Adaptive Lasso, one would in practice tune it to be prediction 
optimal. For the Lasso, this produces too large models, i.e. many noise variables 
are included in the selected model However, noise variables tend to have 
small coefficients and will therefore be heavily penalized in the second smoothing 
step of the Smoothed Adaptive Lasso. 

It is of course also possible to use a smoothed estimator as initial estimator, 
e.g. the Smoothed Lasso. In terms of the number of selected variables, as we 
will see in Section HJ this is often worse than directly using the univariate coun- 
terpart. Due to the reduced variance, the smoothed initial estimator tends to 
select too many variables and not all of them will be eliminated in the second 
stage of the Smoothed Adaptive Lasso. 



L. Meier and P. Biihlmann/ Smoothing l\-penalized estimators 



604 




Fig 1. Parameter functions for model 1 (left) and model 2 (right). 



In view of some empirical results in Section [H we advocate the following: 
the initial estimator for the Smoothed Adaptive Lasso is the univariate Adap- 
tive Lasso; the latter itself uses the univariate Lasso as initial estimator. This 
amounts to be a three-stage procedure where all of the estimations are tuned 
to be prediction optimal using e.g. some cross-validation scheme. There is sub- 
stantial agreement by now that two or more stages are needed to achie ve g ood 
regularization properties in high-dimensional settings 24, 11, 25, Tj^, 21, lpf . As 
a novelty here, our third stage involves an additional smoothing operation. 



4. Simulations 

In this Section we want to evaluate the finite sample properties of the proposed 
estimators. 



4- 1. Design 



We consider the following models, similar to [24j : 
Model 1: Some large effects 

(3(t) = (0.45*, 3sin(t), 3cos(t-3), 0, . . . , 0) 

Model 2: Many small effects 

(3{t) = (0.85 + 0.5 sin(f), 0.85 + 0.5 cos(i), 0.85 + 0.5 sin(* - 1), 
0.85 + 0.5 cos(i - 1), ... , 0.85 + 0.5 sm(t - 3), 
0.85 + 0.5cos(>-3), 0,...,0) 

Figure [1] illustrates the two parameter vectors as a function of time t. We use 
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an equidistant grid on the interval [0, 2n], i.e. t r = (r — l)-^ 2 ^, r € {1, . . . , JV}, 
where N = 18. The design matrix X is simulated from a multivariate nor- 
mal distribution with mean zero and covariancc matrix E id = 0.5^1. The 
standard deviation of the error term is chosen from a £ {2, 4} which corre- 
sponds to a signal-to-noise ratio (averaged over N) of approximately {2.7, 0.7} 
and {3.8, 0.9} for model 1 and model 2, respectively. We use both a "classical" 
setup with n — 50, p — 8 and a high-dimensional setup with n — 100, p = 1000. 

The best combination of bandwidth h and penalization parameter A is being 
searched on a two-dimensional grid using an independent validation set of half 
the size of the training set. This is done independently for each time-point 
which means that we allow for a locally varying bandwidth. The density of 
the standard normal distribution is used as kernel function K{-) for the weight 
function «;(•, •), see Section |2~T1 

For the (Smoothed) Adaptive Lasso with (Smoothed) Lasso as initial estima- 
tor, we first determine the optimal penalization parameter for the initial esti- 
mator and keep it fixed when searching for the optimal penalization parameter 
and bandwidth for the final estimator. 

All estimators which we compare are listed in Table [TJ 

Table 1 
Different estimators 

Univariate Estimators Smoothed Estimators 

1. Lasso 4. Lasso 

2. Adapt. Lasso with OLS 5. Adapt. Lasso with smoothed OLS 

3. Adapt. Lasso with univ. Lasso 6. Adaptive Lasso with 4. 

7. Adaptive Lasso with 3. 



4-2. Performance Measures 

To measure the goodness of fit and the ability to pick the model of the correct 
size we define the following performance measures. 
For the mean squared error we use 

1 N 

r=l 

Moreover, we also report the mean squared prediction error for the regression 
function x T (3(t r ) 

MSEp - 1 (j20o(t r ) - P (U)) 2 + 0(t r ) - p{t r )) T nKt r ) - p(t r ))j , 

where /3o(t r ) is the intercept term (and /3o(t r ) — for our simulations) and X is 
the covariance matrix of the covariates. 
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For the number of variables we define the mean model size 

1 N 

MSize=-J2\Mtr)\ 

r=l 

and the mean number of false positives 

^ N p 

FF= ^EE 1 [ft(Wo] 1 [ft(' r Hl- 

r=l j = l 

In applied sciences where (possibly expensive) experiments arc conducted to 
verify the selected variables (e.g. in biology), the number of false positives is a 
crucial quantity one wants to minimize in order to keep the costs low. 

4-3. Results 

The results can be found in Tabled For the high-dimensional setting we did not 
consider OLS initial estimators. Several conclusions can be made. Let us first 
focus on the Lasso estimator. In all simulation settings, smoothing improves the 
MSEp score substantially. The downside for the Smoothed Lasso estimator is 
that due to the decreased variance, more noise variables tend to enter the model 
which results in larger selected models (with more false positives) than for the 
univariate Lasso estimator. However, in practice one would assign a variable 
importance score to each coefficient and therefore concentrate first on those 
with the largest contributions, whereas many of the false positives have small 
importance scores only. 

Also for the Adaptive Lasso, the MSEp scores get decreased by smoothing 
in all simulation settings. Using a smoothed initial estimator leads to too large 
models. Take for example Adaptive Lasso with Smoothed Lasso as initial esti- 
mator, i.e. proposal 6 in Table [1] As we have described above, the Smoothed 
Lasso tends to select a too large initial model. Although the Adaptive Lasso 
can eliminate most noise variables in the second stage due to their large weights 
from small coefficients of the initial estimator, the resulting models still get a bit 
too large. However, the estimator is very competitive with respect to prediction 
performance. 

Using a univariate initial estimator, i.e. our proposal 7 in Table [TJ to get 
more reasonably sized models seems to be a good compromise. It does not only 
produce the sparsest models but is often also competitive with respect to MSEp 
and MSEp. 

5. Real data: Motif Finding in DNA Sequences 

We apply the smoothing methodology to a problem of motif regression A 
motif (typically a 5-15 letter word consisting of letters A, C, G, and T) is 
a candidate of a binding site of some functional element, e.g. a transcription 



Table 2. Mean values of the different performance measures based on 100 simulation runs for n = 50, p = 8 (top) and n = 100, p = 1000 (bottom). 
Standard deviations are given in parentheses. The low- dimensional case of model 2 can't have false positives because all variables are active. 





MSEp 


MSEp 


MSize 


FP 


MSEp 


MSEp 


MSize 


FP 


Model 1 




a 


= 2 






a 


= 4 




1. 


0.83 (0.17) 


0.65 (0.10) 


5.67 (0.50) 


2.96 (0.47) 


3.14 (0.67) 


2.50 (0.41) 


5.10 (0.57) 


2.69 (0.49) 


2. 


0.71 (0.17) 


0.57 (0.10) 


4.39 (0.50) 


1.82 (0.46) 


3.03 (0.71) 


2.43 (0.42) 


4.13 (0.51) 


1.89 (0.44) 


3. 


0.69 (0.15) 


0.56 (0.09) 


4.30 (0.48) 


1.72 (0.44) 


3.05 (0.72) 


2.43 (0.44) 


4.02 (0.55) 


1.79 (0.46) 


4. 


0.54 (0.14) 


0.43 (0.09) 


6.26 (0.49) 


3.48 (0.48) 


1.93 (0.50) 


1.51 (0.32) 


6.09 (0.61) 


3.42 (0.56) 


5. 


0.46 (0.13) 


0.38 (0.08) 


4.85 (0.50) 


2.15 (0.48) 


1.75 (0.46) 


1.39 (0.30) 


4.89 (0.59) 


2.34 (0.54) 


6. 


0.47 (0.13) 


0.38 (0.09) 


4.80 (0.51) 


2.11 (0.49) 


1.80 (0.48) 


1.43 (0.33) 


4.83 (0.61) 


2.29 (0.54) 


7. 


0.51 (0.14) 


0.41 (0.09) 


4.06 (0.47) 


1.50 (0.44) 


2.29 (0.68) 


1.73 (0.45) 


3.78 (0.52) 


1.58 (0.43) 



Model 2 




a 


= 2 




a 


= 4 


1. 


1.10 (0.19) 


0.86 (0.13) 


7.60 (0.17) 


3.22 (0.42) 


3.06 (0.62) 


6.52 (0.40) 


2. 


1.33 (0.23) 


0.97 (0.15) 


7.35 (0.25) 


4.10 (0.56) 


3.62 (0.71) 


5.90 (0.52) 


3. 


1.30 (0.22) 


0.95 (0.15) 


7.26 (0.24) 


3.99 (0.53) 


3.54 (0.70) 


5.79 (0.47) 


4. 


0.51 (0.12) 


0.42 (0.08) 


7.92 (0.08) 


1.42 (0.34) 


1.29 (0.33) 


7.62 (0.22) 


5. 


0.60 (0.14) 


0.47 (0.10) 


7.64 (0.21) 


1.72 (0.43) 


1.46 (0.37) 


7.11 (0.35) 


6. 


0.64 (0.14) 


0.49 (0.10) 


7.60 (0.21) 


1.84 (0.45) 


1.53 (0.40) 


6.99 (0.39) 


7. 


0.84 (0.20) 


0.61 (0.15) 


7.14 (0.28) 


2.94 (0.57) 


2.54 (0.79) 


5.62 (0.48) 



Model 1 
1. 
3. 
4. 
6. 
7. 



2.16 (0.36) 1.48 (0.23) 

1.07 (0.30) 0.76 (0.18) 

1.05 (0.18) 0.78 (0.12) 

0.48 (0.11) 0.40 (0.09) 

0.74 (0.29) 0.49 (0.14) 



27.35 (5.13) 
7.11 (1.93) 
41.51 (6.62) 
12.74 (3.44) 
5.10 (1.32) 



24.92 (5.11) 
4.79 (1.89) 
38.87 (6.61) 
10.14 (3.42) 
2.79 (1.28) 



5.67 (0.70) 4.22 (0.61) 

4.73 (0.84) 3.36 (0.59) 

3.12 (0.50) 2.41 (0.40) 

2.25 (0.50) 1.81 (0.33) 

3.80 (0.90) 2.47 (0.57) 



19.83 (4.86) 
6.07 (2.08) 
42.94 (7.01) 
17.66 (4.34) 
4.74 (1.53) 



18.05 (4.80) 
4.51 (2.02) 
40.54 (6.99) 
15.36 (4.34) 
3.19 (1.47) 



Model 2 cr = 2 

T. 1.37 (0.23) 1.69 (0.34) 32.90 (6.13) 25.71 (6.17) 

3. 1.52 (0.22) 1.22 (0.23) 12.13 (2.16) 5.82 (2.14) 

4. 0.50 (0.08) 0.51 (0.09) 46.56 (8.63) 38.60 (8.62) 

6. 0.51 (0.09) 0.44 (0.08) 23.41 (5.70) 15.59 (5.70) 

7. 1.00 (0.19) 0.67 (0.14) 9.74 (1.40) 3.52 (1.35) 



a = 4 

3.93 (0.56) 5.42 (0.95) 28.24 (5.06) 22.84 (5.20) 
5.43 (0.57) 5.35 (0.81) 11.70 (2.76) 7.54 (2.70) 
1.41 (0.26) 1.49 (0.31) 51.24 (8.52) 43.49 (8.51) 
1.72 (0.31) 1.55 (0.31) 28.61 (6.66) 21.22 (6.62) 
3.90 (0.50) 3.23 (0.60) 9.14 (1.99) 5.07 (1.92) 
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factor (a protein which regulates gene expression). In [1[ a collection of various 
gene expression time-course experiments and a set of candidate motifs for yeast 
is provided. Gene expression values for a total of 2587 genes are available and 
p = 666 motif candidates are used to build the motif scores for each gene. These 
measure how well the motifs are represented in the upstream regions of the 
genes. We focus on a time-course experiment spanning N — 12 different time- 
points. In summary we have 2587 observations of a 666 dimensional predictor 
(the motif scores) and a one-dimensional response (the gene expression value) at 
each of the 12 time-points. Thus, each row of the design matrix X corresponds 
to a gene and each column to a motif score. The element Xij measures how well 
the jth motif score is represented in the upstream region of the ith gene. 

To illustrate the smoothing methods and the effect of different sizes for the 
training set, we use random subsets of different sizes as training set. An in- 
dependent validation set is used to determine the prediction optimal tuning 
parameters. The size of the validation set is half the size of the training set. The 
remaining data is used as test-set. 

The results for a training set of size 1300 is given in Table [3] In terms of 
prediction error, there is not much gain when smoothing the estimators for this 
data-set, especially for the Adaptive Lasso. A reason for this may be the large 
variance of the error term. Note that for a new test observation (xt es t,ytest) we 
have 

E*t„ t ,i, t „t l(y - Vtestf] = E[(x T /3 - x T [3f] + a 2 . 

The error variance a 2 is likely to be the dominating quantity since motif re- 
gression is known to be very noisy. In terms of variable selection, the smooth- 

Table 3 

Mean squared prediction error (top) and number of selected variables (bottom) for the 
training set of size 1300. Rows: 4 different methods, as described in Table\7\ Columns: 12 

different time-points. 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


1. 


0.07 


0.13 


0.44 


2.71 


1.52 


1.66 


1.98 


2.55 


3.17 


2.77 


3.02 


2.92 


3. 


0.07 


0.13 


0.45 


2.82 


1.59 


1.75 


2.05 


2.64 


3.26 


2.81 


3.09 


3.04 


4. 


0.07 


0.13 


0.44 


2.71 


1.52 


1.66 


1.99 


2.52 


3.17 


2.78 


3.02 


2.91 


7. 


0.07 


0.13 


0.45 


2.86 


1.59 


1.76 


2.06 


2.65 


3.27 


2.85 


3.10 


3.06 


1. 


35 


28 


74 


155 


133 


157 


154 


141 


124 


106 


123 


109 


3. 


10 


6 


29 


80 


85 


35 


47 


31 


40 


30 


42 


49 


4. 


35 


29 


70 


155 


123 


157 


177 


178 


124 


98 


126 


112 


7. 


4 


5 


21 


71 


46 


29 


34 


22 


29 


22 


35 


43 



ing step decreases the model size for the Adaptive Lasso estimator and is po- 
tentially reducing the number of false positives: In particular for time-points 
t r = 1, 3, 5, 7, 8, 9, 10 the Smoothed Adaptive Lasso yields much sparser model 
fits. For the Lasso estimator, the smoothing has a tendency to increase the num- 
ber of selected variables resulting in rather large models. This coincides with 
our findings in Section [4l If we decrease the training sample size to 200 we see 
some small improvement with respect to the mean squared prediction error (not 
shown) . 
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6. Discussion 

We propose smoothing techniques for ^i-penalized (Lasso-type) estimators for 
a time-course of high-dimensional linear models. We show theoretically that for 
the Lasso and the Adaptive Lasso, better estimates in terms of the mean squared 
error can be obtained by combining the responses of different time-points in a 
suitably weighted way. Empirically, the Smoothed Adaptive Lasso estimator 
yields the sparsest models with competitive mean squared error performance 
when using the univariate Adaptive Lasso as initial estimator. The Smoothed 
Lasso estimator has very good performance with respect to the mean squared 
error but selects too many noise variables in general. An additional thresholding 
stage would be necessary if the primary interest is in variable selection. 

The smoothing methodology can also be applied to generalized linear models 
(GLM). The main difference is that we can't rewrite the smoothed estimator 
as an ordinary lasso problem as in (|2.2|) . This implies that the computational 
burden increases: In the worst case (depending on the support of the kernel and 
the bandwidth h), by stacking the response variables and design matrices of the 
different time-points, the total sample size is Nn, which can be substantially 
larger than n in (]2 . 2f) . while the dimensionality is still p. 

Our methodology applies to more general problems than time-course settings. 
For example, we can directly treat the situation of different (heterogeneous) 
data-sets (y{t), X), t = 1, . . . , N (or (y(t),X(t)), t = 1,...,N) with n(t) x 1 
response vectors and n(t) xp design matrices, where t is the index for the various 
data-sets. All we need is a suitable pseudo-distance d(t, s) among the different 
data-sets indexed by t and s. The weights in (|2.2p are then of the form 



The pseudo-distance d(-, •) could be learned from the data, e.g. based on some 
pseudo-metrics for clustering different data-sets. 

Whether the multivariate view over different time-points (or different data- 
sets) pays off for a particular problem is not clear a-priori. However, our method- 
ology encompasses the univariate Lasso methods, by choosing the bandwidth 
h = 0. Hence, using some cross-validation scheme enables to find out whether 
pooling information over different time-points (or data-sets) is worthwhile and 
if so, the Smoothed (Adaptive) Lasso from the multivariate approach renders 
more accurate estimates. 
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Appendix A: Proofs 

A.l. Regularity Conditions 

We denote by f3(t) £ W the true underlying parameter vector as a function of 
time t. 

(RC1) Curvature of underlying function 

Pj(t) is twice continuously differentiable with sup,,- 

all t and some constant C. 
(RC2) Equidistant grid 

For the asymptotic implications we assume that we have an equidistant 
grid around the time-point of interest t r of the form 

s 

ts tr ~t~ 7 

where s = — \_N/2\ , . . . , [N/2\ . Note that we enumerate using negative 
values of s as well. 

(RC3) Sampling Points 

For the time-point of interest t r we assume that if 0j(t r ) = it follows 
that there is an open neighbourhood Uj 3 t r , such that /3j (u) = 0Vu€ 
Uj. Moreover, we require infj diam([/j) > S for some 5 > 0. I.e. for the 
time-point of interest no variable enters or leaves the active set. 

(RC4) Compact kernel 

The kernel function K (•) is assumed to have compact support on [—1, 1]. 



< C < oo for 



A. 2. Proofs 

Proof of Proposition [2TT1 

As we focus on a single time-point t r we omit the time index for notational 
simplicity, whenever possible. For the smoothed response y we have the model 

y = Xf3 + e N , 

at the time-point of interest, where 

[Nh] 

0= w(t s ,t r )0(t s ) 

s=-lNh} 

and 

[Nh] 
s = -[Nhl 

Note that (£at)i, . . . , (£N)n are i.i.d. with mean zero and variance given below. 
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We can now use the decomposition 

< 2||/3 + 2||/3- 



II- 



(A.l) 



The hrst term is "classical". We can use the theory of [j| with respect to an 
error term with reduced variance. For the asymptotic variance we have 

/ [Nh\ 

Var((£jv)i) = Var w (t s ,t r ) e (t s ) i 

\s=-lNh\ 
LJVfcJ 



s=-[7V/ l J 

1 x2 |A^J 



/J_^^LJV/ij K 2 s/N_ 

2 \Nh) l^ s =-[Nh\ \ h 

f 1 ^[Nhj K (s/N\Y 

\Nh L-, s =-iNh\ \ /i yj 
Using a Riemann sum approximation, we arrive at 

Var((%) 1 ) ~ ^j- I K 2 (x)dx, 



(A.2) 



i.e. the error variance is of order l/(Nh). 

Let us now consider the bias term. If (3i(t r ) = it follows with the com- 
pactness assumption of the kernel and (R-CEJ) that for h — h n small enough 
j3 t (t r ) = 0. If 0i(t r ) ^ we have 



[Nh\ 



w(t„tr)Pi (?r + Jj) 

l Nh i ( 1 s 2 1 

W(t s ,t r ) i fa{t r ) + p' i {t r )jz + 2^^Jp \ 

_ i at h i ^ ~ y 



s=-\Nh\ 



YNh\ 



s=-[_Nh\ 



2 

iV 2 "' 



where |r s — t r \ < 4>. 
Hence, by (RCJTJ), 



/8j(*r) - A(*r) <^~ C J2 w(t s ,t r )=Ch 2 

g=-[iV7iJ 



[Nh\ 



Therefore 



\\(3~[3\\ 2 <\A n \C 2 h 4 



(A.3) 
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for h — h n small enough. 

If we choose h n x log(rt) 1 / 5 n~ 1 / 5 iV _1 / 5 , all terms in (|A.1[) are of the same 
order. 

□ 



Proof of Proposition [2721 

We use the decomposition in (|A.1[) . Since the variance in the smoothed case is 
of order l/(Nh), see l(A~2|) . we obtain 

W0-0WI <0 P ((Nh)-^ 2 a n ). (A.4) 
On the other hand, we have by (|A.3[) 

\\P-(3\\l^\A n \h\ (A.5) 
The optimal rate for the bandwidth minimizing the terms in (|A.4|) and (|A.5[) is 

h opt = N^ 9 \A n \- 2/9 a 2 J 9 -> oo, n oo 
and we obtain using (|A.ll) . (|A.4j) and (|A.5|) 

III - (3f 2 < Op((Nh opt )-^ 2 a n ). (A.6) 

Since 

Nh opt X iV«/ 9 |A,r 2/9 a 2 J 9 -> oo, 7i oo 

because 7V„ ^ |.4 n | 1/ ' 4 a^ 1 ^ 4 , we see from (|A.6|) that a faster convergence rate 
op(a„) is achieved. 

□ 



Proof of Theorem [3711 

As in the proof of Proposition 12. li we have the model 

y = X(3 + e N 

for the smoothed response y. Multiplying both sides with y/Nh results in 

§ = X0 + i N , (A.7) 

with y = V Nhy, X = V NhX and £n = V NMn- 

Note that the variance of the error term en depends on N. As can be seen 
from (1A.2|) . we have for N — > oo 

Var((Ijv).) -cr 2 / if 2 (x)da;. 



Using the rescaled model (IA.7I) , we can now adapt the proof of [24 1. 
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Let us first focus on the problem on the original scale. We re-parameterize 
the parameter vector (3 as 

P = P+^^, 
v^JVh 



or u = VnNh(P — P) 6 MP. The quantity of interest is u = v nNh((3 — P), where 

u = argmin-0 n (u), 

u 

with 



II 



x i ( h 



3 = 1 



y/nNh 



i=i 



By multiplying tp n (u) with iVTi, we can rewrite u = axgxnhxip n (u), where 



V^NhJ 



V^Nh 



and A„ = NhX n - Now we can follow the proof of [2_4j. With slight changes, 
because of the non-constant variance of the error-term, we arrive at 

V^Nh0 A - Pa) 4 N(0, ^(Oaa)' 1 ), 

where A is the active set of the unsmoothed parameter vector, i.e. the parameter 
vector at the current time-point. Finally, observe that 



Vm(Pa - Pa) = V^Nh(p A - Pa) + V^Nh{p A - Pa), 

and that we get for the second term analogously as in (|A.3|) . using \A n | < P < oo, 

nNh \\Pa~ Pa Ha < CnNh 5 

for some constant C. If we choose h — o(n _1 / 5 A r_1 / 5 ), the asymptotic normality 
part follows. 



The proof of model selection consistency is analogous to [24 1 . 



□ 
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